1 Learning Outcomes

1.2 Context

  • Text Mining can be considered as a process for extracting insights from text
  • Computer-based Text Mining has been around since the 1950s with automated translations, or the 1940s if you want to consider Trying to Break Codes
  • The CRAN Task View: Natural Language Processing (NLP) lists over 100 different resources, including 58 packages, focused on gathering, organizing, modeling, and analyzing text.

  • In addition to text mining or analysis, NLP has multiple areas of research and applicationTop 10 Applications of NLP
  1. Machine Translation: translation without any human intervention
  2. Speech Recognition: Alexa, Hey Google, Siri, … understanding your questions
  3. Sentiment Analysis: also known as opinion mining or emotion AI
  4. Question Answering: Alexa, Hey Google, Siri, … answering your questions so you can understand
  5. Automatic Summarization: Reducing large volumes to meta-data or sensible summaries
  6. Chat bots: Combinations of 2 and 4 with short-term memory and context for specific domains
  7. Market Intelligence: Automated analysis of your searches, posts, tweets, …
  8. Text Classification: Automatically analyze text and then assign a set of pre-defined tags or categories based on its content e.g., organizing and determining relevance of reference material
  9. Character Recognition
  10. Spelling and Grammar Checking

2 Tidy Text

2.1 Tidy Text Format

There are multiple ways to organize text for analysis:

  • strings
  • corpus: (a library of documents) collections of raw strings, often from many documents, annotated with metadata
  • Document-Term Matrix (DTM): a sparse matrix a row for all the words in one document and each column representing one term or word. The entries are generally counts or tf-idf (term frequency - inverse document freq) scores for the column’s word in the row’s document. With multiple rows, there are a lot of 0s, so usually stored as a sparse matrix.
    • The Term-Document Matrix (TDM) is the transpose of the DTM
  • We will focus on Tidy text format (ttf): A table (tibble) with one token per row where a token is a meaningful unit of text such as a word, n-gram (multiple words), a sentence, a paragraph, on up to whole chapters or books.

  • Organizing single words or N-grams without any sense of order, is often referred to by the term “Bag of Words” as each token is treated independently of the other tokens in the document; only the counts or tf-idfs matter.

  • More sophisticated methods are now using neural word embeddings where the words are encoded into vectors that attempt to capture (get trained upon) the context from other (near/far) words in the document. Word2Vec and Google’s BERT are two examples See Beyond Word Embeddings Part 2.

2.2 Example 1

  • Read in the following text

    text <- c("If You Forget Me",
    "by Pablo Neruda",
    "I want you to know",
    "one thing.",
    "You know how this is:",
    "if I look",
    "at the crystal moon, at the red branch",
    "of the slow autumn at my window,",
    "if I touch",
    "near the fire",
    "the impalpable ash",
    "or the wrinkled body of the log,",
    "everything carries me to you,",
    "as if everything that exists,",
    "aromas, light, metals,",
    "were little boats",
    "that sail",
    "toward those isles of yours that wait for me."
    )
    text
    ##  [1] "If You Forget Me"                             
    ##  [2] "by Pablo Neruda"                              
    ##  [3] "I want you to know"                           
    ##  [4] "one thing."                                   
    ##  [5] "You know how this is:"                        
    ##  [6] "if I look"                                    
    ##  [7] "at the crystal moon, at the red branch"       
    ##  [8] "of the slow autumn at my window,"             
    ##  [9] "if I touch"                                   
    ## [10] "near the fire"                                
    ## [11] "the impalpable ash"                           
    ## [12] "or the wrinkled body of the log,"             
    ## [13] "everything carries me to you,"                
    ## [14] "as if everything that exists,"                
    ## [15] "aromas, light, metals,"                       
    ## [16] "were little boats"                            
    ## [17] "that sail"                                    
    ## [18] "toward those isles of yours that wait for me."

2.2.1 Create a tibble with line indicators

  • Turn it into a tibble with a variable for the line and one for the text

    text_df <- tibble(
      line = 1:length(text),
      text = text
    )
    text_df
    ## # A tibble: 18 x 2
    ##     line text                                         
    ##    <int> <chr>                                        
    ##  1     1 If You Forget Me                             
    ##  2     2 by Pablo Neruda                              
    ##  3     3 I want you to know                           
    ##  4     4 one thing.                                   
    ##  5     5 You know how this is:                        
    ##  6     6 if I look                                    
    ##  7     7 at the crystal moon, at the red branch       
    ##  8     8 of the slow autumn at my window,             
    ##  9     9 if I touch                                   
    ## 10    10 near the fire                                
    ## 11    11 the impalpable ash                           
    ## 12    12 or the wrinkled body of the log,             
    ## 13    13 everything carries me to you,                
    ## 14    14 as if everything that exists,                
    ## 15    15 aromas, light, metals,                       
    ## 16    16 were little boats                            
    ## 17    17 that sail                                    
    ## 18    18 toward those isles of yours that wait for me.

2.2.2 Tidy Text the tibble with unnest()

  • text_df is not in tidy text format so use unnest_tokens() to convert

    text_df %>%
      unnest_tokens(word, text)
    ## # A tibble: 80 x 2
    ##     line word  
    ##    <int> <chr> 
    ##  1     1 if    
    ##  2     1 you   
    ##  3     1 forget
    ##  4     1 me    
    ##  5     2 by    
    ##  6     2 pablo 
    ##  7     2 neruda
    ##  8     3 i     
    ##  9     3 want  
    ## 10     3 you   
    ## # … with 70 more rows
  • Note: Punctuation has been stripped.
  • By default, unnest_tokens() converts the tokens to lowercase. (Use to_lower = TRUE to retain case).

2.3 Remove Stop Words with an anti_join() on stop_words

  • You also see a lot of common words such as “I”, “the”, “and”, “or”, …
  • These are called “stop words”: extremely common words not useful for an analysis
  • Load stop_words, a tidytext data frame of 1149 stopwords based on three different lexicons
  • Use an anti_join() to remove the stopwords (return all rows from x where there are not matching values in y)
  • Get the word counts
  • Save to a new tibble

    data(stop_words)
    text_df %>%
      unnest_tokens(word, text) %>%
      anti_join(stop_words) %>% # get rid of uninteresting words
      count(word, sort = TRUE) -> # count of each word left
    text_word_count
    text_word_count # note: only 26 rows instead of 80
    ## # A tibble: 26 x 2
    ##    word        n
    ##    <chr>   <int>
    ##  1 aromas      1
    ##  2 ash         1
    ##  3 autumn      1
    ##  4 boats       1
    ##  5 body        1
    ##  6 branch      1
    ##  7 carries     1
    ##  8 crystal     1
    ##  9 exists      1
    ## 10 fire        1
    ## # … with 16 more rows

2.4 Example 2: Jane Austen

  • Let’s look at a larger set of text, say all of Jane Austen’s six major novels.
  • Get the data from the janeaustenr package. You may need to install the package.

    library(janeaustenr)
  • Use austen_books() to access the data frame of the books which has two columns:
    • text contains the text of the novels divided into elements of up to about 70 characters each
    • book contains the titles of the novels as a factor in order of publication.
  • Group by book, add row numbers, find the chapters, and save to a new data_frame with chapter, linenumber, and text.

    austen_books() %>% 
      group_by(book) %>%
      mutate(linenumber = row_number(),
             chapter = cumsum(str_detect(text, 
                 regex("^chapter [\\divxlc]", 
                       ignore_case = TRUE)))) %>%
      ungroup() %>% 
      select(chapter, linenumber, everything()) ->
          orig_books
    orig_books
    ## # A tibble: 73,422 x 4
    ##    chapter linenumber text                    book               
    ##      <int>      <int> <chr>                   <fct>              
    ##  1       0          1 "SENSE AND SENSIBILITY" Sense & Sensibility
    ##  2       0          2 ""                      Sense & Sensibility
    ##  3       0          3 "by Jane Austen"        Sense & Sensibility
    ##  4       0          4 ""                      Sense & Sensibility
    ##  5       0          5 "(1811)"                Sense & Sensibility
    ##  6       0          6 ""                      Sense & Sensibility
    ##  7       0          7 ""                      Sense & Sensibility
    ##  8       0          8 ""                      Sense & Sensibility
    ##  9       0          9 ""                      Sense & Sensibility
    ## 10       1         10 "CHAPTER 1"             Sense & Sensibility
    ## # … with 73,412 more rows
  • Convert to Tidy Text format
  • Get rid of formatting symbols from the Gutenberg Library symbology, e.g., “text” means italicized
  • Remove Stop Words
  • Look at the counts

    orig_books %>%
      unnest_tokens(word, text) %>%
      # use str_extract for the words as part of the encoding
      mutate(word = str_extract(word, "[a-z']+")) %>%
      anti_join(stop_words)  ->
      tidy_books
    
    tidy_books %>%
      count(word, sort = TRUE)
    ## # A tibble: 13,464 x 2
    ##    word       n
    ##    <chr>  <int>
    ##  1 miss    1860
    ##  2 time    1339
    ##  3 fanny    862
    ##  4 dear     822
    ##  5 lady     819
    ##  6 sir      807
    ##  7 day      797
    ##  8 emma     787
    ##  9 sister   727
    ## 10 house    699
    ## # … with 13,454 more rows
  • Plot the most common words in total in descending order

    tidy_books %>%
      count(word, sort = TRUE) %>%
      filter(n > 400) %>%
      mutate(word = reorder(word,n)) %>% 
      ggplot(aes(word, n)) +
        geom_col() +
        xlab(NULL) +
        coord_flip()

2.4.1 Exercise: Plot the most common words in total in descending order indicating the contribution of each book.

2.4.2 Exercise. Find the words that occur the most in each book that do not occur in any other book.

  • Hint: Consider using pivot_wider and pivot_longer

2.5 Compare Frequencies across Authors

  • Let’s compare Jane Austen with science fiction writer H.G. Wells (The Island of Doctor Moreau, The War of the Worlds, The Time Machine, and The Invisible Man) with the [Bronte Sisters (https://en.wikipedia.org/wiki/Bront%C3%AB_family)(Jane Eyre, Wuthering Heights, Agnes Grey, The Tenant of Wildfell Hall and Villette) since they are from a similar era as Jane Austen.

2.5.1 Project Gutenberg and the gutenbergr package

  • We’ll use Project Gutenberg as our source
  • The gutenbergr package includes metadata for all Project Gutenberg works, so they can be searched and retrieved.
  • Install package if necessary and load the library

    library(gutenbergr)
  • The function gutenberg_download()downloads one or more works from Project Gutenberg
    • To download Frankenstein by gutenberg_ID, use gutenberg_download(84).
  • To find a work’s gutenberg_ID, use function gutenberg_works()
    • You an search on the exact title or look for the author in the author meta-data dataframe and then find the work IDs for the author.

      gutenberg_works() %>%
        filter(title == "Wuthering Heights")
      ## # A tibble: 1 x 8
      ##   gutenberg_id title author gutenberg_autho… language gutenberg_books… rights
      ##          <int> <chr> <chr>             <int> <chr>    <chr>            <chr> 
      ## 1          768 Wuth… Bront…              405 en       Gothic Fiction/… Publi…
      ## # … with 1 more variable: has_text <lgl>
      #or use str_detect
      gutenberg_works() %>%
        filter(str_detect(title,"Wuthering Heights") )%>% head()
      ## # A tibble: 2 x 8
      ##   gutenberg_id title author gutenberg_autho… language gutenberg_books… rights
      ##          <int> <chr> <chr>             <int> <chr>    <chr>            <chr> 
      ## 1          768 "Wut… Bront…              405 en       Gothic Fiction/… Publi…
      ## 2        40655 "The… Malha…            40751 en       <NA>             Publi…
      ## # … with 1 more variable: has_text <lgl>
      # or Find the author ID and then the work IDs
      gutenberg_authors[(str_detect(gutenberg_authors$author, "Wells")),]
      ## # A tibble: 19 x 7
      ##    gutenberg_author… author    alias  birthdate deathdate wikipedia   aliases   
      ##                <int> <chr>     <chr>      <int>     <int> <chr>       <chr>     
      ##  1                30 Wells, H… Wells…      1866      1946 http://en.… <NA>      
      ##  2               135 Brown, W… <NA>          NA      1884 http://en.… Brown, W.…
      ##  3              1060 Wells, C… Hough…      1862      1942 http://en.… <NA>      
      ##  4              3499 Wells, P… <NA>        1868      1929 <NA>        Wells, P.…
      ##  5              4952 Wells, J… Wells…      1855      1929 <NA>        <NA>      
      ##  6              5765 Wells-Ba… <NA>        1862      1931 http://en.… <NA>      
      ##  7              7102 Wells, F… <NA>        1874      1929 <NA>        <NA>      
      ##  8             32091 Reeder, … <NA>        1884        NA <NA>        <NA>      
      ##  9             32113 Wells, S… <NA>        1820      1875 <NA>        Wells, Sa…
      ## 10             32327 Wells, H… <NA>        1900        NA <NA>        <NA>      
      ## 11             33148 Wells, B… Wells…      1912      2003 <NA>        <NA>      
      ## 12             33381 Wells, D… Wells…      1828      1898 <NA>        <NA>      
      ## 13             34869 Wells, D… <NA>        1868      1900 <NA>        <NA>      
      ## 14             36067 Wells, N… <NA>          NA        NA <NA>        <NA>      
      ## 15             37054 Wells, K… <NA>        1838      1911 <NA>        Wells, Ca…
      ## 16             39127 Williams… <NA>        1888      1945 <NA>        <NA>      
      ## 17             39415 Wells, A… Wells…      1862      1933 <NA>        <NA>      
      ## 18             39662 Smith, M… <NA>        1840      1930 <NA>        Smith, Ma…
      ## 19             41468 Rogers, … <NA>        1873        NA <NA>        <NA>
      gutenberg_works(gutenberg_author_id == 30) %>%
        arrange(title) %>% 
        mutate(stitle = str_trunc(title,40)) %>% 
        select(stitle, gutenberg_id) 
      ## # A tibble: 54 x 2
      ##    stitle                                       gutenberg_id
      ##    <chr>                                               <int>
      ##  1 "A Modern Utopia"                                    6424
      ##  2 "A Short History of the World"                      35461
      ##  3 "An Englishman Looks at the World\r\nBei..."        11502
      ##  4 "Ann Veronica: A Modern Love Story"                   524
      ##  5 "Anticipations\r\nOf the Reaction of Mec..."        19229
      ##  6 "Boon, The Mind of the Race, The Wild ..."          34962
      ##  7 "Certain Personal Matters"                          17508
      ##  8 "First and Last Things: A Confession o..."           4225
      ##  9 "Floor Games; a companion volume to \"L..."          3690
      ## 10 "God, the Invisible King"                            1046
      ## # … with 44 more rows

2.5.2 Convert the texts for Wells and Bronte into their own data frames in Tidy Text format, without Stop Words

  • Wells IDs are (35, 36, 159, 5230))
  • Bronte IDs are (767, 768, 969, 1260, 9182))

    hgwells <- gutenberg_download(c(35, 36, 159, 5230))
    bronte <- gutenberg_download(c(767, 768, 969, 1260, 9182))
    
    tidy_hgwells <- hgwells %>%
      unnest_tokens(word, text) %>%
      mutate(word = str_extract(word, "[a-z']+")) %>%
      anti_join(stop_words)
    tidy_bronte <- bronte %>%
      unnest_tokens(word, text) %>%
      mutate(word = str_extract(word, "[a-z']+")) %>%
      anti_join(stop_words)
    
    tidy_hgwells %>%
      count(word, sort = TRUE)
    ## # A tibble: 11,648 x 2
    ##    word       n
    ##    <chr>  <int>
    ##  1 time     454
    ##  2 people   302
    ##  3 door     260
    ##  4 heard    249
    ##  5 black    232
    ##  6 stood    229
    ##  7 white    222
    ##  8 hand     218
    ##  9 kemp     213
    ## 10 eyes     210
    ## # … with 11,638 more rows
    tidy_bronte %>%
      count(word, sort = TRUE)
    ## # A tibble: 22,516 x 2
    ##    word       n
    ##    <chr>  <int>
    ##  1 time    1065
    ##  2 miss     856
    ##  3 day      828
    ##  4 hand     768
    ##  5 eyes     713
    ##  6 night    647
    ##  7 heart    638
    ##  8 looked   602
    ##  9 door     592
    ## 10 half     588
    ## # … with 22,506 more rows

2.5.3 Put all three authors together in one tibble with new columns for author and proportion of words.

  • Create a new variable in each author’s data frame with the author’s name
  • Combine into a dataframe
  • Clean out any formatting to get just the words
  • Get the word counts for each author
  • Create a word proportion for each author

    bind_rows(mutate(tidy_bronte, author = "Bronte"),
              mutate(tidy_hgwells, author = "Wells"),
              mutate(tidy_books, author = "Austen")) %>%
      mutate(word = str_extract(word, "[a-z']+")) %>%
      count(author, word) %>%
      group_by(author) %>%
        mutate(proportion = n / sum(n)) %>%
        select(-n) -> freq_by_author_by_word
    arrange(freq_by_author_by_word,word)
    ## # A tibble: 47,628 x 3
    ## # Groups:   author [3]
    ##    author word      proportion
    ##    <chr>  <chr>          <dbl>
    ##  1 Bronte a'most    0.0000160 
    ##  2 Austen a'n't     0.00000462
    ##  3 Bronte aback     0.00000400
    ##  4 Wells  aback     0.0000150 
    ##  5 Bronte abaht     0.00000400
    ##  6 Bronte abandon   0.0000320 
    ##  7 Wells  abandon   0.0000150 
    ##  8 Austen abandoned 0.00000462
    ##  9 Bronte abandoned 0.0000920 
    ## 10 Wells  abandoned 0.000180  
    ## # … with 47,618 more rows
  • At this point we have the data by author for each word.
  • Now we can pivot wider to break out each author and then pivot longer with Bronte and Wells to be able to compare each to Austen

    freq_by_author_by_word %>% 
        pivot_wider(names_from = author, values_from = proportion) ->
    frequency_by_word_across_authors
    frequency_by_word_across_authors
    ## # A tibble: 28,678 x 4
    ##    word          Austen      Bronte      Wells
    ##    <chr>          <dbl>       <dbl>      <dbl>
    ##  1 a'n't     0.00000462 NA          NA        
    ##  2 abandoned 0.00000462  0.0000920   0.000180 
    ##  3 abashed   0.00000462  0.0000160  NA        
    ##  4 abate     0.00000924  0.0000120  NA        
    ##  5 abatement 0.0000185  NA          NA        
    ##  6 abating   0.00000462  0.00000800 NA        
    ##  7 abbey     0.000328   NA           0.0000300
    ##  8 abbeyland 0.00000462 NA          NA        
    ##  9 abbeys    0.00000924 NA          NA        
    ## 10 abbots    0.00000462 NA          NA        
    ## # … with 28,668 more rows
    frequency_by_word_across_authors %>%
      pivot_longer(Bronte:Wells, names_to = "author", names_ptypes = list(factor()), 
                   values_to = "proportion") ->
      frequency
    
    arrange(frequency, word)
    ## # A tibble: 57,356 x 4
    ##    word         Austen author  proportion
    ##    <chr>         <dbl> <chr>        <dbl>
    ##  1 a'most  NA          Bronte  0.0000160 
    ##  2 a'most  NA          Wells  NA         
    ##  3 a'n't    0.00000462 Bronte NA         
    ##  4 a'n't    0.00000462 Wells  NA         
    ##  5 aback   NA          Bronte  0.00000400
    ##  6 aback   NA          Wells   0.0000150 
    ##  7 abaht   NA          Bronte  0.00000400
    ##  8 abaht   NA          Wells  NA         
    ##  9 abandon NA          Bronte  0.0000320 
    ## 10 abandon NA          Wells   0.0000150 
    ## # … with 57,346 more rows

2.5.4 Graph the proportions of word usage of Wells and Bronte to Jane Austen.

  • Use library scales to help customize the plot

    library(scales) 
    frequency %>% ggplot(aes(x = proportion, 
              y = `Austen`, 
              color = abs(`Austen` - proportion))) +
      geom_abline(color = "gray40", lty = 2) +
      geom_jitter(alpha = 0.1, size = 2.5, 
                  width = 0.3, height = 0.3) +
      geom_text(aes(label = word), 
                check_overlap = TRUE, vjust = 1.5) +
      scale_x_log10(labels = percent_format()) +
      scale_y_log10(labels = percent_format()) +
      scale_color_gradient(limits = c(0, 0.001), 
                           low = "darkslategray4",
                           high = "gray75") +
      facet_wrap(~author, ncol = 2) +
      theme(legend.position="none") +
      labs(y = "Jane Austen", x = NULL)
  • We can tell Austen and Bronte are more similar (grouped closer to the line) than Austen and Wells.
  • Let’s use a correlation test to quantify the amounts.

    df_Bronte <- frequency[frequency$author == "Bronte",]
    #head(df_Bronte)
    cor.test(data = df_Bronte,  ~ proportion + `Austen`)
    ## 
    ##  Pearson's product-moment correlation
    ## 
    ## data:  proportion and Austen
    ## t = 119.07, df = 10299, p-value < 2.2e-16
    ## alternative hypothesis: true correlation is not equal to 0
    ## 95 percent confidence interval:
    ##  0.7528218 0.7690770
    ## sample estimates:
    ##       cor 
    ## 0.7610689
    df_Wells <- frequency[frequency$author == "Wells",]
    #head(df_Wells)
    cor.test(data = df_Wells,  ~ proportion + `Austen`)
    ## 
    ##  Pearson's product-moment correlation
    ## 
    ## data:  proportion and Austen
    ## t = 36.296, df = 6010, p-value < 2.2e-16
    ## alternative hypothesis: true correlation is not equal to 0
    ## 95 percent confidence interval:
    ##  0.4030622 0.4445345
    ## sample estimates:
    ##       cor 
    ## 0.4240206

3 Sentiment Analysis

3.1 Intro

  • When humans read text, we use our understanding of the emotional intent of words to infer whether a section of text is positive or negative, or perhaps characterized by some other more nuanced emotion like surprise or disgust.
    • Especially when authors are “showing not saying” the emotional context
  • Sentiment Analysis (also known as opinion mining) uses computer-based text analysis, or other methods to identify, extract, quantify, and study affective states and subjective information from text.
    • Commonly used by businesses to analyse customer comments on products or services.
  • Naive approach: get the sentiment of each word and add them up for a given amount of text.
    • This approach does not take into account word qualifiers like not, never, always, etc..
  • Generally, if we add up over many paragraphs, the positive and negative words will cancel each other out.
  • So, we are usually better off adding either by sentence or by paragraph.

3.2 Sentiment Lexicons Assign Sentiments to Words (based on “common” usage)

  • There are several sentiment lexicons - here are three examples:
    • AFINN from Finn Arup Nielsen,
    • bing from Bing Liu and collaborators
    • nrc from Saif Mohammad and Peter Turney
    sentiments %>% arrange(word)
    ## # A tibble: 6,786 x 2
    ##    word        sentiment
    ##    <chr>       <chr>    
    ##  1 2-faces     negative 
    ##  2 abnormal    negative 
    ##  3 abolish     negative 
    ##  4 abominable  negative 
    ##  5 abominably  negative 
    ##  6 abominate   negative 
    ##  7 abomination negative 
    ##  8 abort       negative 
    ##  9 aborted     negative 
    ## 10 aborts      negative 
    ## # … with 6,776 more rows
    get_sentiments("afinn")
    ## # A tibble: 2,477 x 2
    ##    word       value
    ##    <chr>      <dbl>
    ##  1 abandon       -2
    ##  2 abandoned     -2
    ##  3 abandons      -2
    ##  4 abducted      -2
    ##  5 abduction     -2
    ##  6 abductions    -2
    ##  7 abhor         -3
    ##  8 abhorred      -3
    ##  9 abhorrent     -3
    ## 10 abhors        -3
    ## # … with 2,467 more rows
    get_sentiments("bing")
    ## # A tibble: 6,786 x 2
    ##    word        sentiment
    ##    <chr>       <chr>    
    ##  1 2-faces     negative 
    ##  2 abnormal    negative 
    ##  3 abolish     negative 
    ##  4 abominable  negative 
    ##  5 abominably  negative 
    ##  6 abominate   negative 
    ##  7 abomination negative 
    ##  8 abort       negative 
    ##  9 aborted     negative 
    ## 10 aborts      negative 
    ## # … with 6,776 more rows
    get_sentiments("nrc")
    ## # A tibble: 13,901 x 2
    ##    word        sentiment
    ##    <chr>       <chr>    
    ##  1 abacus      trust    
    ##  2 abandon     fear     
    ##  3 abandon     negative 
    ##  4 abandon     sadness  
    ##  5 abandoned   anger    
    ##  6 abandoned   fear     
    ##  7 abandoned   negative 
    ##  8 abandoned   sadness  
    ##  9 abandonment anger    
    ## 10 abandonment fear     
    ## # … with 13,891 more rows
    unique(get_sentiments("nrc")$sentiment)
    ##  [1] "trust"        "fear"         "negative"     "sadness"      "anger"       
    ##  [6] "surprise"     "positive"     "disgust"      "joy"          "anticipation"

    3.2 Fear Example

  • Since the nrc lexicon gives us emotions, we can look at words labelled as just “fear” if we choose.
  • Get the Jane Austen books into tidy text format

    austen_books() %>%
      group_by(book) %>%
      mutate(linenumber = row_number(),
             chapter = cumsum(str_detect(text, 
                        regex("^chapter [\\divxlc]",
                       ignore_case = TRUE)))) %>%
      ungroup() %>%
      # use `word` so the inner_join will match with the nrc lexicon
      unnest_tokens(word, text) ->
      tidy_books
    tidy_books
    ## # A tibble: 725,055 x 4
    ##    book                linenumber chapter word       
    ##    <fct>                    <int>   <int> <chr>      
    ##  1 Sense & Sensibility          1       0 sense      
    ##  2 Sense & Sensibility          1       0 and        
    ##  3 Sense & Sensibility          1       0 sensibility
    ##  4 Sense & Sensibility          3       0 by         
    ##  5 Sense & Sensibility          3       0 jane       
    ##  6 Sense & Sensibility          3       0 austen     
    ##  7 Sense & Sensibility          5       0 1811       
    ##  8 Sense & Sensibility         10       1 chapter    
    ##  9 Sense & Sensibility         10       1 1          
    ## 10 Sense & Sensibility         13       1 the        
    ## # … with 725,045 more rows
  • Filter out all words from the nrc lexicon that are not “fear” words
  • Use an inner-join() to select the words in Emma that are “fear” words

    nrcfear <- get_sentiments("nrc") %>%
      filter(sentiment == "fear")
    
    tidy_books %>%
      filter(book == "Emma") %>%
      inner_join(nrcfear) %>%
      count(word, sort = TRUE)
    ## # A tibble: 364 x 2
    ##    word         n
    ##    <chr>    <int>
    ##  1 doubt       98
    ##  2 ill         72
    ##  3 afraid      65
    ##  4 marry       63
    ##  5 change      61
    ##  6 bad         60
    ##  7 feeling     56
    ##  8 bear        52
    ##  9 creature    39
    ## 10 obliging    34
    ## # … with 354 more rows
  • Notice it is not always clear why a word is a “fear” word.

3.2.1 Exercise: Plot the number of fear words in each chapter for each book.

  • Consider using scales = "free_x" in facet_wrap()

  • How many words are associated with the other sentiments in nrc?

    get_sentiments("nrc") %>%
      group_by(sentiment) %>%
      count()
    ## # A tibble: 10 x 2
    ## # Groups:   sentiment [10]
    ##    sentiment        n
    ##    <chr>        <int>
    ##  1 anger         1247
    ##  2 anticipation   839
    ##  3 disgust       1058
    ##  4 fear          1476
    ##  5 joy            689
    ##  6 negative      3324
    ##  7 positive      2312
    ##  8 sadness       1191
    ##  9 surprise       534
    ## 10 trust         1231

3.3 Looking at Larger Blocks of Text for Positive and Negative

  • Let’s create 80 line blocks from tidy_books and use the bing lexicon to categorize each word as positive or negative. + Recall the words in tidy_books are in sequential order by line number
    • Use inner_join() to filter out words not in bing and add the sentiment column
    • create 80 line blocks of text using index = line_number %/% 80 in count()
    • Pivot wider on sentiment to get the counts in separate columns (set missing values to 0)
    • Add a column with the net = positive - negative

      tidy_books %>%
        inner_join(get_sentiments("bing")) %>% 
        count(book, index = linenumber %/% 80, sentiment) %>% 
        pivot_wider(names_from = sentiment, values_from = n, values_fill = list(n=0)) %>% 
        mutate(sentiment = positive - negative) ->
        janeaustensentiment
      
      janeaustensentiment %>%
        ggplot(aes(index, sentiment, fill = book)) +
        geom_col(show.legend = FALSE) +
        facet_wrap(~book, ncol = 2, scales = "free_x")

  • We should probably look at which words contribute to the positive and negative sentiment and be sure we want to include them as part of the sentiment.

    tidy_books %>%
      inner_join(get_sentiments("bing")) %>%
      count(word, sentiment, sort = TRUE) %>%
      ungroup() ->
      bing_word_counts
    
    bing_word_counts
    ## # A tibble: 2,585 x 3
    ##    word     sentiment     n
    ##    <chr>    <chr>     <int>
    ##  1 miss     negative   1855
    ##  2 well     positive   1523
    ##  3 good     positive   1380
    ##  4 great    positive    981
    ##  5 like     positive    725
    ##  6 better   positive    639
    ##  7 enough   positive    613
    ##  8 happy    positive    534
    ##  9 love     positive    495
    ## 10 pleasure positive    462
    ## # … with 2,575 more rows
  • Let’s plot the top ten for each sentiment

    bing_word_counts %>%
      group_by(sentiment) %>%
      top_n(10) %>% 
      ungroup() %>%
      mutate(word = reorder(word, n)) %>%
      ggplot(aes(word, n, fill = sentiment)) +
      geom_col(show.legend = FALSE) +
      facet_wrap(~sentiment, scales = "free_y") +
      labs(y = "Contribution to sentiment",
           x = NULL) +
      coord_flip()
  • Not what we want for Jane Austen novels!! Miss is probably not a negative word, but rather refers to a young girl.

3.4 Two Approaches to Adjust an Improper Sentiment:

  • Take the word miss out of the data before doing the analysis (Add to the stop words) or
  • Change the sentiment lexicon to no longer have “miss” as a negative.

  • Remove miss from the text by adding to stop words and repeat

    custom_stop_words <- bind_rows(data_frame(
          word = c("miss"),
          lexicon = c("custom")), 
           stop_words)
    
    custom_stop_words
    ## # A tibble: 1,150 x 2
    ##    word        lexicon
    ##    <chr>       <chr>  
    ##  1 miss        custom 
    ##  2 a           SMART  
    ##  3 a's         SMART  
    ##  4 able        SMART  
    ##  5 about       SMART  
    ##  6 above       SMART  
    ##  7 according   SMART  
    ##  8 accordingly SMART  
    ##  9 across      SMART  
    ## 10 actually    SMART  
    ## # … with 1,140 more rows
    # Now, let's redo with the new stop words.
    austen_books() %>%
      group_by(book) %>%
      mutate(linenumber = row_number(),
             chapter = cumsum(str_detect(text, 
                        regex("^chapter [\\divxlc]",
                       ignore_case = TRUE)))) %>%
      ungroup() %>%
      # use word so the inner_join will match with the nrc lexicon
      unnest_tokens(word, text) %>%
      anti_join(custom_stop_words) ->
      tidy_books_no_miss
    
    tidy_books_no_miss %>%
      inner_join(get_sentiments("bing")) %>%
      count(word, sentiment, sort = TRUE) %>%
      ungroup() ->
      bing_word_counts
    
    bing_word_counts
    ## # A tibble: 2,554 x 3
    ##    word      sentiment     n
    ##    <chr>     <chr>     <int>
    ##  1 happy     positive    534
    ##  2 love      positive    495
    ##  3 pleasure  positive    462
    ##  4 poor      negative    424
    ##  5 happiness positive    369
    ##  6 comfort   positive    292
    ##  7 doubt     negative    281
    ##  8 affection positive    272
    ##  9 perfectly positive    271
    ## 10 glad      positive    263
    ## # … with 2,544 more rows
    bing_word_counts %>%
      group_by(sentiment) %>%
      top_n(10) %>%
      ungroup() %>%
      mutate(word = reorder(word, n)) %>%
      ggplot(aes(word, n, fill = sentiment)) +
      geom_col(show.legend = FALSE) +
      facet_wrap(~sentiment, scales = "free_y") +
      labs(y = "Contribution to sentiment",
           x = NULL) +
      coord_flip()

  • Remove the word “miss” from the bing sentiment lexicon.

    get_sentiments("bing") %>%
      filter(word != "miss") ->
    bing_no_miss
    
    tidy_books %>%
      inner_join(bing_no_miss) %>%
      count(word, sentiment, sort = TRUE) %>%
      ungroup() ->
      bing_word_counts
    
    bing_word_counts
    ## # A tibble: 2,584 x 3
    ##    word     sentiment     n
    ##    <chr>    <chr>     <int>
    ##  1 well     positive   1523
    ##  2 good     positive   1380
    ##  3 great    positive    981
    ##  4 like     positive    725
    ##  5 better   positive    639
    ##  6 enough   positive    613
    ##  7 happy    positive    534
    ##  8 love     positive    495
    ##  9 pleasure positive    462
    ## 10 poor     negative    424
    ## # … with 2,574 more rows
    # visualize it
    bing_word_counts %>%
      group_by(sentiment) %>%
      top_n(10) %>%
      ungroup() %>%
      mutate(word = reorder(word, n)) %>%
      ggplot(aes(word, n, fill = sentiment)) +
      geom_col(show.legend = FALSE) +
      facet_wrap(~sentiment, scales = "free_y") +
      labs(y = "Contribution to sentiment",
           x = NULL) +
      coord_flip()

3.4.1 Repeat the Plot by chapter

  • Original and No Miss

    # Original
    tidy_books %>%
      inner_join(get_sentiments("bing")) %>% 
      count(book, index = linenumber %/% 80, sentiment) %>% 
      pivot_wider(names_from = sentiment, values_from = n, values_fill = list(n=0)) %>% 
      mutate(sentiment = positive - negative) ->
      janeaustensentiment
    
    janeaustensentiment %>%
      ggplot(aes(index, sentiment, fill = book)) +
      geom_col(show.legend = FALSE) +
      facet_wrap(~book, ncol = 2, scales = "free_x")

    #No Miss
    tidy_books %>%
      inner_join(bing_no_miss) %>% 
      count(book, index = linenumber %/% 80, sentiment) %>% 
      pivot_wider(names_from = sentiment, values_from = n, values_fill = list(n=0)) %>% 
      mutate(sentiment = positive - negative) ->
      janeaustensentiment
    
    janeaustensentiment %>%
      ggplot(aes(index, sentiment, fill = book)) +
      geom_col(show.legend = FALSE) +
      facet_wrap(~book, ncol = 2, scales = "free_x")

3.5 WordCloud plots

  • The wordcloud package uses base R graphics to create Word Clouds

    library(wordcloud)
    tidy_books %>%
      anti_join(stop_words) %>%
      count(word) %>%
      with(wordcloud(word, n, max.words = 100))

    library(reshape2)
    
    tidy_books %>%
      inner_join(bing_no_miss) %>%
      count(word, sentiment, sort = TRUE) %>%
      acast(word ~ sentiment, value.var = "n", fill = 0) %>%
      comparison.cloud(colors = c("red", "blue"),
                       max.words = 100)
  • You can make them, but should you?
  • Consider the ChatterPlot

  • Try a repeat of top 50 Jane Austen words by sentiment and books

    library(ggrepel)
    tidy_books %>% 
      inner_join(bing_no_miss) %>% 
      count(book, word, sentiment, sort = TRUE) %>%
      mutate(proportion = n/sum(n)) %>% 
      group_by(sentiment) %>% 
      top_n(50) %>% 
      ungroup()-> 
      tempp
    tempp %>% 
      ggplot(aes(book, proportion, label = word )) +
      # ggrepel geom, make arrows transparent, color by rank, size by n
      geom_text_repel(segment.alpha = 0, 
                  aes(colour=sentiment, size=proportion)) +
      # set word size range & turn off legend
      scale_size_continuous(range = c(3, 6), guide = FALSE) +
      theme(axis.text.x = element_text(angle = 90)) +
      ggtitle("Top 50 Words by Sentiment in Each Book")

3.6 Looking at Larger Groups of Words

  • The sentiment analysis we just did was based on single words and so did not consider the presence of modifers such as “not” which tend to flip the context.

3.6.1 Sentence Example

  • Consider the data set prideprejudice which has the complete text divided into elements of up to about 70 characters each.
  • If the unit for tokenizing is ngrams, skip_ngrams, sentences, lines, paragraphs, or regex, unnest_tokens() will collapse the entire input together before tokenizing unless collapse = FALSE.
  • Let’s add a chapter variable and also add a period after the number
  • unnest_tokens() separates sentences at periods so get rid of periods after Mr., Mrs., and Dr. as a small clean up.

    tibble(text = prideprejudice) %>% 
      mutate(chapter = cumsum(str_detect(text, 
                     regex("^chapter [\\divxlc]", ignore_case = TRUE))),
             text = str_replace(text, "(Chapter \\d+)","\\1\\."),
             text = str_replace_all(text, "((Mr)|(Mrs)|(Dr))\\.","\\1")) %>% 
      unnest_tokens(sentence, text, token = "sentences") ->
      PandP_sentences
  • Add sentences numbers and unnest at the word level
  • Add sentiments
  • Get rid of the cover page (Chapter 0)
  • Count the number of positive and negative words per sentence
  • Pivot wider to break out the sentiments
  • Create a score for each sentences 1 for more positive words than negative m 0 for same numbers and -1 for more negative words than positive
  • Summarize by chapter as a total score divided by the number of sentences in the chapter
  • Create a line plot of sentiment score by chapter to see a view of the story arc

    PandP_sentences %>% 
      mutate(sentence_number = row_number()) %>% 
      unnest_tokens(word,sentence) %>% 
      inner_join(get_sentiments("bing")) %>% 
      filter(chapter >0) %>% 
      count(chapter, sentence_number, sentiment) %>% 
      pivot_wider(names_from = sentiment, values_from = n, values_fill = list(n = 0)) %>% 
      mutate(sentence_sent = positive - negative) %>%  
      mutate(sentence_sent = case_when(
        sentence_sent >0 ~1,
        sentence_sent == 0 ~0,
        sentence_sent <0 ~ -1
      )) %>%  
      group_by(chapter) %>% 
      summarize(chap_sent_per = sum(sentence_sent)/n()) %>% 
      ggplot(aes(chapter, chap_sent_per)) +
      geom_line()+
      ggtitle("Sentence Sentiment Score per Chapter") +
      ylab("(Score/Total Sentences in a Chapter") +
      xlab("Chapter") +
      geom_hline(yintercept = 0, color = "red", alpha = .4, lty = 2) +
      scale_x_continuous(limits = c(1,61)) + 
      geom_rug(sides = "b")
  • Chapter 36 appears to be the low point.

3.6.2 An Alternative at the Chapter Level

  • Consider all the Austen books
  • Look for the most negative chapters based on number of words in the chapter
  • Take out the word “miss”

    get_sentiments("bing") %>% 
      filter(sentiment == "negative") %>% 
      filter(word != "miss")-> 
      bingnegative
    
    tidy_books %>%
      group_by(book, chapter) %>%
      summarize(words = n()) ->
      wordcounts
    
    tidy_books %>%
      semi_join(bingnegative) %>%
      group_by(book, chapter) %>%
      summarize(negativewords = n()) %>%
      left_join(wordcounts, by = c("book", "chapter")) %>%
      mutate(ratio = negativewords/words) %>%
      filter(chapter != 0) %>%
      top_n(1) %>%
      ungroup()
    ## # A tibble: 6 x 5
    ##   book                chapter negativewords words  ratio
    ##   <fct>                 <int>         <int> <int>  <dbl>
    ## 1 Sense & Sensibility      43           156  3405 0.0458
    ## 2 Pride & Prejudice        34           111  2104 0.0528
    ## 3 Mansfield Park           46           161  3685 0.0437
    ## 4 Emma                     16            81  1894 0.0428
    ## 5 Northanger Abbey         21           143  2982 0.0480
    ## 6 Persuasion                4            62  1807 0.0343
  • These are the chapters with the most sad words in each book, normalized for number of words in the chapter. What is happening in these chapters?
    • In Chapter 43 of Sense and Sensibility Marianne is seriously ill, near death
    • In Chapter 34 of Pride and Prejudice, Mr. Darcy proposes for the first time (so badly!).
    • Chapter 46 of Mansfield Park is almost the end, when everyone learns of Henry’s scandalous adultery.
    • Chapter 16 of Emma is when back at Hartfield after her ride with Mr. Elton, Emma plunges into self-recrimination as she looks back over the past weeks.
    • Chapter 21 of Northanger Abbey is when Catherine is deep in her Gothic faux fantasy of murder, etc.
    • Chapter 4 of Persuasion is when the reader gets the full flashback of Anne refusing Captain Wentworth and how sad she was and what a terrible mistake she realized it to be.